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The covariance matrix of a p-dimensional random variable is a fundamental 
quantity in data analysis. Given n i.i.d. observations, it is typically estimated 
by the sample covariance matrix, at a computational cost of 0(np 2 ) oper¬ 
ations. When n,p are large, this computation may be prohibitively slow. 
Moreover, in several contemporary applications, the population matrix is 
approximately sparse, and only its few large entries are of interest. This 
raises the following question, at the focus of our work: Assuming approxi¬ 
mate sparsity of the covariance matrix, can its large entries be detected much 
faster, say in sub-quadratic time, without explicitly computing all its p 2 en¬ 
tries? In this paper, we present and theoretically analyze two randomized 
algorithms that detect the large entries of an approximately sparse sample 
covariance matrix using only 0{np poly log p) operations. Furthermore, as¬ 
suming sparsity of the population matrix, we derive sufficient conditions on 
the underlying random variable and on the number of samples n, for the 
sample covariance matrix to satisfy our approximate sparsity requirements. 
Finally, we illustrate the performance of our algorithms via several simula¬ 
tions. sparse covariance matrix, sub-quadratic time complexity, multi-scale 
group testing. 


1. Introduction 

Let Z = (Z\,..., Z p ) be a p-dimensional real valued random variable with a p x p 
covariance matrix E. Given n i.i.d. samples of Z, denoted {zj}™ =1 , a fundamental task in 
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statistical inference is to estimate E. A standard estimator of E is the sample covariance 
matrix S = ^T(zj — z)(zj — z) T , where z is the sample mean. Direct computation 
of all p 2 entries of S requires 0{np 2 ) operations. In various contemporary data analysis 
applications, where both p and n are large, this computation may be prohibitively slow 
and challenging in terms of memory and storage. 

In several applications, however, the population covariance matrix is approximately 
sparse, whereby only its few large entries are of interest, and the remaining entries are 
either small or even precisely equal to zero. Applications leading to sparse covariance 
matrices include, among others, gene arrays and biological networks [lj, social networks, 
climate data and f-MRI scans. As we are only interested in the large entries, a key 
question is whether these can be computed significantly faster, possibly by cleverly 
detecting their locations, without directly computing the entire matrix. 

In this paper we present and theoretically analyze two different randomized algorithms 
with sub-quadratic time complexity, to detect and subsequently compute the large entries 
of a sparse sample covariance matrix. First, in Section [21 we present a reduction of 
this problem to the sparse-Fast-Fourier-Transform (sFFT) [2]. A solution to our task 
then directly follows by invoking multiple calls to the recently developed randomized 
sFFT sub-linear time algorithm. Next, in Section [I] we present a simpler and more 
direct algorithm, based on the construction of O(logp) binary random trees. We prove 
that under suitable assumptions on the sparsity of the matrix S, both algorithms are 
guaranteed, with high probability, to locate its large entries. Furthermore, their runtimes 
are 0(nrp log 3 p) operations for the sFFT-based algorithm, and 0(nrp log 2 p) operations 
for the tree-based method, where r is a bound on the number of large entries in each 
row of S. By suitable normalization of the input data, both algorithms can also detect 
large entries of a sparse sample correlation matrix, whose entries are , 13 . 

y SuSjj 

The theoretical analysis of the two algorithms of Sections [3] and 0] relies on the as¬ 
sumption that S is approximately sparse. In reality, in various applications one may only 
assume that the population matrix is approximately sparse. To this end, in Section [5] we 
provide sufficient conditions on E, the underlying random variable Z and the number of 
samples n that ensure, w.h.p., the approximate sparsity of S. 

Finally, in Section 0] we empirically compare the tree-based algorithm with other 
methods to detect large entries of S. In addition, we illustrate on artificially generated 
data that includes near duplicates the potential applicability of our algorithm for the 
Near Duplicate Detection problem, common in document corpora analysis [3] . 

1.1. Related works 

In the statistics literature, the problem of sparse covariance estimation has received 
significant attention, see for example ® E, E 2 B3- As discussed in Sections [3] and 0] 
below, under suitable sparsity assumptions, our algorithms are guaranteed to find with 
high probability all entries of the sample covariance matrix which are larger, in absolute 
value, than some threshold p. The resulting matrix is then nothing but the sample 
covariance matrix, hard-thresholded at the value p. The statistical properties of such 
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thresholding were intensively studied, see for example mmm and references therein. 
The main motivation of these works was to derive a more accurate estimate of the 
population covariance matrix £, assuming it is sparse, thus overcoming the asymptotic 
inconsistency of S in the operator norm, in the joint limit p, n oo with p/n —>• c, see 
for example |9j. Moreover, hard-thresholding with a threshold that slowly tends to zero 
as p, n —» oo, was proven to be asymptotically minimax optimal under various sparsity 
models. Our work, in contrast, is concerned with the computational effort of computing 
this thresholded estimator, mainly for finite p, n and relatively large thresholds. 

Focusing on the computational aspects, first of all note that the sample covariance 
matrix S can be represented as a product of two suitable matrices. Hence, using fast 
matrix multiplication methods, all its entries can be computed faster than 0(np 2 ). Cur¬ 
rently, the fastest matrix multiplication algorithm for square matrices of size N x N 
has a time-complexity of 0(IV 2 ' 3727 ) |10j . Hence, by expanding (with zero padding) the 
input sample matrix to a square matrix, all entries of S can be exactly evaluated using 
0(max{n,p} 2 ' 3727 ) operations. 

In recent years, several works developed fast methods to approximate the product of 
two matrices, see for example DU, as well as [12] and [I3j which assume that the product 
is approximately sparse. In particular, in a setting similar to the one considered in this 
paper, the method of m, based on fast approximate matrix multiplication, can detect 
the large entries of a sparse covariance matrix in time complexity comparable to ours. 

In addition, since the entries of S can be represented as inner products of all-pairs 
vectors, our problem is directly related to the Maximum Inner Product Search (MIPS) 
problem pang. In the MIPS problem, given a large dataset {xj}, the goal is to quickly 
find, for any query vector y, its maximal inner product maxj(y,Xj). [T3] presented 
three algorithms to retrieve the maximal inner product, which can be generalized to find 
the k largest values. While [14] did not provide a theoretical analysis of the runtime of 
their algorithms, empirically on several datasets they were significantly faster than direct 
computation of all inner products. Recently, m presented a simple reduction from the 
approximate -MIPS problem, of finding maxj(y, x*) up to a small distortion e, to the 
well-studied k nearest neighbour problem (&NN). This allows to solve the approximate- 
MIPS problem using any fcNN procedure. The (exact or approximate) A:NN search 
problem has been extensively studied in the last decades, with several fast algorithms, see 
ns El OBI and references therein. For example, kd -tree [19] is a popular exact algorithm 
for low dimensional data, whereas Local-sensitive-hashing (LSH) approximation methods 
[20] are more suitable to high dimensions. Combining the reduction of m with the 
LSH-based algorithm of [21] yields an approximate solution of the MIPS problem with 
0 (np 7 poly log p) query time, where 7 € (0,1) controls the quality of the approximation. 
These methods can be exploited to approximate the sample covariance matrix, perhaps 
with no assumptions on the matrix but with slower (or no) runtime guarantees. In 
section [6] we empirically compare our sub-quadratic tree-based algorithm to some of the 
above methods. 

Another line of work related to fast estimation of a sparse covariance matrix is the 
matrix sketching problem [22]. Here, the goal is to recover an unknown matrix X, given 
only partial knowledge of it, in the form of a linear projection AXB with known matrices 
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A and B. The matrix sketching problem is more challenging, since we are given only 
partial access to the input. In fact, recent solutions to this problem [23 J have time 
complexity at least 0(np 2 ). 

A more closely related problem is the all-pairs similarity search with respect to the 
cosine similarity or the Pearson-correlation. Here, given a set of vectors {xj}, the task 
is to find the pair with highest cosine similarity, max^ x/xj/||x,;||||xj||. Two popular 
exact methods, suitable mainly for sparse inputs, are All-Pairs [23] and PPJoin+ [3]. 
As in the nearest-neighbours search problem, hashing can be adapted to obtain various 
approximation algorithms (25]. These algorithms can be used to rapidly compute a 
sparse correlation matrix. In a special case of the all-pairs similarity search, known 
as the Light Bulb Problem [26], one is given p boolean vectors all of length n and 
taking values ±1. The assumption is that all vectors are uniformly distributed on the 
n-dimensional boolean hypercube, apart from a pair of two vectors which have a Pearson- 
correlation p ydog p/n. The question is how fast can one detect this pair of correlated 
vectors. LSH type methods as well as bucketing coding m solve this problem with a 
sub-quadratic time complexity of 0{np 9 ^), for a suitable function g(p) of the correlation 
coefficient. More recently, [28J developed a method with expected sub-quadratic time 
complexity of 0(np ie2 ) operations, where the exponent value 1.62 is independent of 
p and directly related to the complexity of the fastest known matrix multiplication 
algorithm. It is easy to show that the methods proposed in our paper can solve the 
Light Bulb Problem using only 0(nppoly logp) operations, but assuming a significantly 

stronger correlation of p > Furthermore, our methods offer an interesting 

tradeoff between time complexity and correlation strength p. For example, our tree- 
based algorithm with 0(p a ) trees (instead of O(logp)), can detect weaker correlations 

of strength p > 0(^J^~p—) with a time complexity of 0(np 1+a poly log p) operations. 

2. Notation and Problem Setup 

For simplicity, in this paper we assume the input data (z \,..., z n ) is real-valued, though 
the proposed methods can be easily modified to the complex valued case. For a vector 
x £ M n , we denote its z-th entry by Xj (or (x)j), its L 2 -norm by ||x|| := yjY^i=i x z > and 
its Lo-norm by ||x| |o := |{z : x, ^ 0}|. Similarly, for a matrix A £ M pxp we denote its k- th 
row by A k , its th entry by Aij (or (A^j), its L 2 -norm by ||A|| := sup x _^ 0 and 

its Frobenius norm by ||A||i? := A 2 -. For an integer a £ N, let [a] := {1, 2,..., a}. 

To simplify notations, the inner product between two complex vectors x, y £ C n is 
defined with a normalization factor, 

1 71 
i—1 

where f represents the complex conjugate, and will be relevant only when we discuss the 
Fourier transform which involves complex valued numbers. 
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Given the input data {zi,... , z n }, let x$ = ((zi)j — (zj), • • • , (z n )j — (z n )) € M n be 
a mean centered vector of observations for the i-th coordinate of Z. This leads to the 
simple representation of S, in terms of inner products 

Sij = (xj,Xj), 1 < i,j < p. (1) 

For a matrix A € M pxp and a threshold parameter p, we define the set of large entries 
at level // in the /c-th row to be 

= {j ■ |^4.fcj| A 1 } 

and the set of all its large entries at level p by 

p 

MA ) = 

fc=i 

As for the definition of matrix sparsity, we say that a matrix A is (r, //) -sparse if for 
every row /c, |J /i (A/ c )| < r. We say that A is (r, p, i?, g)-sparse if it is (r, /r)-sparse and 
for every k € [p] the remaining small entries in the /c-th row are contained in a L g -ball 
of radius R, 

( E i^i 9 ) 7 <*• 

\j<£JA A k) / 

Here we only consider the case where q = 2 and i? < /i/2. 

Problem Setup. Let {zi,...,z n } be n input vectors whose covariance matrix S is 
(r,/i)-sparse, with r <C p. In this paper we consider the following task: Given {z i}f =1 
and the threshold value /i, find the set which consists of all entries of S which 

are larger than p, in absolute value. A naive approach is to explicitly compute all p 2 
entries of S and then threshold at level p. This requires 0(np 2 ) operations and may be 
quite slow when p> 1. In contrast, if an oracle gave us the precise locations of all large 
entries, computing them directly would require only 0(npr ) operations. 

The key question studied in this paper is whether we can find the set J^(S), and 
compute the corresponding entries significantly faster than 0(np 2 ). In what follows, we 
present and analyze two sub-quadratic algorithms to discover J M (5), under the assump¬ 
tion that the matrix S is approximately sparse. 

3. Sparse covariance estimation via sFFT 

The first solution we present is based on a reduction of our problem to that of multiple 
independent instances of sparse Fourier transform calculations. Whereas standard fast- 
Fourier-transform (FFT) of a vector of length p requires 0(p\ogp) operations, if the 
result is a-priori known to be approximately sparse, it can be computed in sub-linear 
time. This problem, known as sparse-FFT (sFFT), has been intensively studied in the 
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past few years, see PMEDIEI]. As described below, the computational gains for our 
problem then directly follow by an application of one of the available sFFT algorithms. 

In what follows we present this reduction, and focusing on the algorithm of [2] we 
analyze under which sparsity conditions on S it is guaranteed to succeed w.h.p. To this 
end, we use the following definitions for the discrete Fourier transform (DFT) T : C p —>• 
C p and its inverse J 7-1 , 

(^M)j = J2 (- F_1 [ x ])i = ~ x ^ 

1=1 p 1=1 

where u: = exp(27ri/p), i = y/—l. Without any assumptions on the input x, the fastest 
known method to compute its DFT is the FFT which requires 0{p\ogp ) operations. 
If the vector -F[x] is approximately r-sparse, with only r large entries, it is possible 
to estimate it in sub-linear time, by detecting and approximately evaluating only its 
large entries. In particular, [2] developed a randomized algorithm to compute the sparse 
Fourier transform with time complexity of 0(r\ogp\og{p/r)), which we refer to here as 
the sFFT algorithm. Formally, for any input vector x G C p and parameters a, 6, r, the 
sFFT algorithm returns a sparse vector x such that, w.p. > 2/3 

||-F[x] — x|| < (1 + a) min ||.F[x] - y|| + <5||.F[x]||. (2) 

l|y||o<r 

The algorithm time complexity is 0(^log(p/r)log(p/5)), though for our purposes we 
assume that a is fixed (e.g. a = 1) and hence ignore the dependence on it. The output 
x of the sFFT algorithm is represented as a pair (J, y) where J C {1,... ,p} is a set of 
indices and y G are the values of x at these indices, namely x| j = y and x|jc = 0. 

We now present the reduction from the sparse covariance estimation problem to sparse 
FFT. To this end, let us define a matrix W = (wj,..., w p ) G c nx P whose j’-th column 
is w j = | Y^=\ . Note that using standard FFT methods, W can be calculated in 

0{np\ogp) operations. The following lemma describes the relation between the matrix 
S and the matrix W. 

Lemma 1. For any k G \p\, let = ((x*,, w/,..., (x*,, w p )) G C p . Then, 

•F[u k ] = S k . (3) 

Proof. Eq. Q follows directly by applying the inverse DFT on S k , 

i p i p 

= -^(x fc ,Xi)w jZ = (x fc , - ^x/w - - 7 '} = (x fc ,Wj) = (u k )j. 
p i=i p i=i 

□ 

According to Lemma [fl each row S k of S is the DFT of an appropriate vector u^. 
Since by assumption S k is approximately sparse, we may thus find its set of large indices 
in sublinear-time by applying the sFFT algorithm on the input u^.. We then explicitly 
compute the corresponding entries in S k directly from the original data xi,..., x p . 
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Computing all p entries of all p vectors {u^} requires a total of 0{np 2 ) operations. 
However, with this time complexity we could have computed all entries of the ma¬ 
trix S to begin with. The key point that makes this reduction applicable is that 
the sFFT algorithm is sub-linear in time and to compute its output it reads at most 
0(rlog(p/6)log(p/r)) coordinates of u^,. In particular, it does not require a-priori eval¬ 
uation of all p entries of each vector u^.. Hence, we may compute on-denrand only the 
entries of requested by the algorithm. Since computing a single entry of can be 
done in 0(n ) operations, the total number of operations required for a single sFFT run 
is 0(nrlog(p/5) log (p/r)). 

To detect the large entries in all rows of S, all p outputs of the sFFT algorithm should 
simultaneously satisfy Eq. (J2]) with high probability. Given that each sFFT run succeeds 
w.p. > 2/3, we show below that it suffices to invoke m = O(logp) independent queries 
of the sFFT algorithm on each input u^. The output for each row is the union of the 
large indices found by all sFFT runs. 

In more detail, for each row k let (Ji,yi),..., be the outputs (indices and 

values) of sFFT on m independent runs with the same input u^.. Our approximation for 
the k- th row of S is then 



Sfcj j € U j 
0 otherwise 


(4) 


The following lemma and corollary prove that m = O(logp) runs suffice to detect all 
large entries of S with a constant success probability. 


Lemma 2. Let m = [log(3p)/log(3)] = 0(log(p)). Then, for each k € [p] the following 
inequality holds w.p. > 1 — 


IIS* - S k \\< (1 + a) min ||S* - y|| + <5||5 fc ||. 

l|y||o<r 


Corollary 1. Letm= [log(3p)/log(3)] as in the previous lemma. Then with probability 
> 2/3, simultaneously for all rows k E \p\ 

IIS* — S k \\< (1 + a) min ||S* - y|| + 6\\S k \\. (5) 

y:||y||o<r 

The proof of corollary Q] follows from a standard union-bound argument, details omit¬ 
ted. 

To conclude, we use multiple runs of the sFFT algorithm to find a set I of indices in S, 
whose entries S/j may be potentially large, and which we then evaluate explicitly. The 
resulting approximation of S is compactly represented as This pro¬ 

cedure is summarized in Algorithm [T| The following Theorem, proven in the appendix, 
provides a bound on its runtime and a guarantee for the accuracy of its output. 

Theorem 1. Assume S is (r, p, R, 2) -sparse, where p > 2 R + e for known R, e > 0, 
and let M = max, Sa. Then, Algorithm [7J which invokes the sFFT algorithm with 
parameters 5 = R+ ^ M and a = 1, has a runtime of 0(nrp log 2 plog((f? + y/rM)p/ e)) 
operations, and w.p. > 2/3, its output set I is guaranteed to include the set J^iS) of all 
large entries of S. 
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Algorithm 1 sFFTCovEstimation(xi,..., x p , r, R, e) 

Input: 

(xi,..., x p ): p vectors of dimension n. 

r: bound on the number of large entries in each row. 

R, e: sparsity parameters. 

Output: S: Compact representation of the large entries of S. 

1 : Compute W = (wi,..., w p ) € C nxp using FFT, where w j = ^ Y^!i=i x / w_:?7 
2 : Set 7 = 0 

3: Compute all diagonal entries Su and the value M = maxj Su 
4: for j = 1 ,p do 

5: Calculate (Ji, yi),..., (,7 m , y m ) by running m = O(logp) sFFT queries with input 

U k = ((Xfe, Wj),..., (x fc , w P )),S= R+ ^ M and a = 1 

6 : Add {k} X ((Xi Ji) to I 

7: end for 
8 : for (i. j) G I do 
9: Calculate Sij = (x,;. x^) 

10 : end for 

11 : return S = {((i,j), Sij)}^ eI 


4. Tree-based Algorithm 

We now present a second more direct method to efficiently detect and compute the large 
entries of a sparse covariance matrix. This method, which assumes the threshold p is 
a priori known, is based on a bottom-up construction of m random binary trees. To 
construct the Z-th tree, we first place at its p leaves the following n dimensional vectors 
{% x i}i=n where Vlj 1 N( 0,1). Then, at higher levels, the n-dimensional vector in 
each parent node is the sum of its offsprings. After the construction of the trees, the main 
idea of the algorithm is to make recursive coarse-to-fine statistical group tests, where 
all indices of various subsets A C [p] are simultaneously tested whether they contain at 
least one large entry of S or not. 

In more details, given as input a row k G \p\ and a set A C [p], we consider the 
following query or hypothesis testing problem: 

R 0 : J t t(S k )nA = tt vs. Ux : D A + 0. 

Assuming S is (r, p, R, 2)-sparse, with R < p/2, one way to resolve this query is by 
computing the following quantity 

F(k,A) = ^(x fc , Xj ) 2 . ( 6 ) 

j&A 

Indeed, under Hq, F(k,A ) < R 2 , whereas under T-L\, F(k,A) > p 2 . 

A direct calculation of (JUJ) requires 0{n\A\) operations, which may reach up to 0(np) 
when \A\ is large. Instead, the algorithm estimates the value of F(k,A ) in Eq. (JHJ) using 
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m i.i.d. samples {yi\YLi °f the random variable 

Y(A) = <x fej 

j£A 

where ( 771 ,..., p p ) ~ N( 0, I p ). Since by definition E Y (, A ) = 0 and Var[Y (M)] = F(k,A ), 
Eq. (J 6 |) can be estimated by the empirical variance of the m variables {yi\YL\- Again, 
given a specific realization of r/j’s, direct calculation of Ecp (0 also requires 0{n\A\) 
operations. Here our construction of m binary trees comes into play, as it efficiently 
provides us samples of yL j r lji f° r an Y subset of the form A = {2 h i ,..., 2 h (i + 1 )}, 

where h € [log 2 p] and i € \p/2 h }. As described in section H~ 2 l below, after the pre¬ 
processing stage of constructing the m trees, each query requires only 0(nm ) operations. 
Moreover, we show that m = O(logp) trees suffice for our purpose, leading to 0(n log p) 
query time. 

To find large entries in the k-th row, a divide and conquer method is applied: start 
with A = { 1 ,... ,p}, and check if F(k,A) of Eq. 0 is large by invoking an appropriate 
query. If so, divide A into two disjoint sets and continue recursively. With this approach 
we reduce the number of required operations for each row from 0(np ) to O (nr log 2 p), 
leading to an overall time complexity of O (nrp log 2 p). 

4.1. Preprocessing Stage - Constructing the trees 

To efficiently construct the rn trees, first mp i.i.d. samples {r/ij} are generated from a 
Gaussian distribution IV(0,1). For simplicity, we assume that p = 2 L , for some integer 
L, leading to a full binary tree of height L + 1. Starting at the bottom, the value at the 
j-th leaf in the l -th tree is set to be n-dimensional vector Then, in a bottom-up 

construction, the vector of each node is the sum of its two offsprings. Since, each tree 
has 2p — 1 nodes, and calculating the vector of each node requires 0(n) operations, the 
construction of m i.i.d. random trees requires 0(nmp ) operations. 

For future use, we introduce the following notation: for a given tree T, we denote 
by T(h,i ) the i-th node at the h -th level, where the root is considered to be at level 
zero. Furthermore, we denote by I(h,i ) the set of indices corresponding to the leaves of 
the subtree starting from T(h,i ). The vector stored at the node T(h,i) is the following 
n-dimensional random variable 

Val(T(h,i))= r /? x j 

jel(h,i) 

whose randomness comes only from the random variables yj (we here consider the sam¬ 
ples {xj} as fixed). The entire tree construction procedure is described in Algorithm 

El 

4.2. Algorithm description 

As mentioned before, we assume that the threshold p is an input parameter to the 
algorithm. Given a set A = I(h,i ) and a vector X&, to estimate 0 the algorithm uses 
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Algorithm 2 ConstructTrees(xi,..., x p , m) 

Input: 

(xi,..., x p ): p vectors of dimension n. 
m: number of trees. 

Output: mn random binary trees (Tf,..., T m ). 

1: Generate rjij. for j € \p\ and l € [m], where r)ij 1 L JV(0,1) 

2: for l = 1,..., m do 

3: for j = 1,... ,p do 

4: Set Val(Ti(L,j )) = rjijitj 

5: end for 

6: for h = L — 1,..., 0 do 

7: for z = 1,..., 2 h do 

8: Set Val(Ti(h, i)) = Val(Ti(h + 1, 2i)) + Val(Ti(h + 1, 2i + 1)) 

9: end for 

10: end for 

11: end for 


m i.i.d. samples of (0 obtained from the relevant node T(h,i) of the tree. For this, let 
yi,.... y m be the m i.i.d. samples of Y = Y(I(h,i )), 

yi = (x k ,Val(Ti(h,i))) = ^ (x. k ,x. j )rn j . 

jel{h,i) 


Since EY = 0 and 

a 2 ■= Var[Y) = ^ (^k,^j) 2 

jel(h,i) 

a natural estimator for 0 is the sample variance 
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Algorithm 3 Find(x, h, i, m, {?}}, (i) 

Input: 

x: input vector of dimension n. 

( h,i ): tree-node index. 
m: number of trees. 

{Ti}: a collection of m trees. 

H : threshold parameter. 

Output: The set of large entries in the current sub-tree. 

1: if h = L then 
2: return {«} 

3: end if 

4: Set a = £iY l Z:i( x , Val ( T l( h + 1 , 2i ))) 2 : b = (x,Val(Ti(h + l,2i + l))) 2 

5: Set S a = 0, Sb = 0 

6 : if a > then 

7: S a = Find(x, h + 1, 2i, m, {T/},/r) 

8: end if 

9: if b > then 

10: Sb = Find(x, h + 1, 2i + 1, m, {T/}, fi) 

11: end if 

12: return S a U Sb 


Recall that for a matrix S which is 2)-sparse with R < \xf 2, the exact value of 

Eq. © allows us to perfectly distinguish between the following two hypotheses 


:/(M) n = 0 vs. u x :J(M)n MS k )^ 
Given the estimate a 2 , the algorithm considers the following test 

■ Y 3^ 2 


Ho 


( 8 ) 

(9) 


If d 2 > the algorithm continues recursively in the corresponding subtree. Lemma 
[3] below shows that m = 0 (log trees suffice to correctly distinguish between the two 
hypotheses w.p. at least 1 — 5. 

In summary, the algorithm detects large entries in the k -th row by processing the m 
random trees with the vector x^; starting from the root, it checks if one of its children 
contains a large entry using the query presented in Eq. ©, and continues recursively 
as required. This recursive procedure is described in Algorithm [3) Then, as described 
in Algorithm [I] the complete algorithm to detect all large entries of a sparse covariance 
matrix applies Algorithm 3 separately to each row. As in Algorithm [H the output of 
Algorithm 0] is a compact representation of the large entries of the matrix, as a set of 
indices / C [p] x [p] and their corresponding entries {£V,'}( 
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Algorithm 4 SparseCovTree(xi,..., x p , m, /i) 

Input: 

(xi,..., x p ): p vectors of dimension n. 
m: number of trees. 
p: threshold parameter. 

Output: S: compact representation of large entries of S. 

1 : Construct m trees: {7}} = ConstructTrees(xi,..., x p ,m) 

2 : Set 7 = 0 

3: for k = 1,... ,p do 

4: Add {k} x Find(xfc, 0,1, m, {7]}, p) to I 

5: end for 

6: for (i,j) € 7 do 

7: Calculate Sij = (x,;. Xj) 

8: end for 

9: return {((*, j), *S' ij )}(j J ) e7 


4.3. Theoretical analysis of the Tree-Based algorithm 

Two key questions related to Algorithm [I] are: i) can it detect w.h.p. large entries 
of S] and ii) what is its runtime. In this section we study these questions under the 
assumption that the matrix S is approximately sparse. We start our analysis with the 
following lemma, proven in the appendix, which shows that m = O(log^) trees suffice 
for the test in Eq. ([9]) to succeed w.p. at least 1 — 5. 

Lemma 3. Assume S is (r, fi, R, 2) -sparse, where R < /i/2. Then, for m > 641og(|), 


Pr[/aZse alarm] = Pr 
Pr [mis detection] = Pr 




~2 



< 6 



< 5 


Remark 1. While the constant of 64 is not sharp, the logarithmic dependence on 5 is. 

Remark 2. The choice of g to be Gaussian is rather arbitrary. Standard concentration 
inequalities \32\j imply that every sub-Gaussian distribution with zero mean and variance 
1 will yield similar results, albeit with different constants. 

Next, assuming S is approximately sparse, Theorem [2] below shows that w.h.p., Algo¬ 
rithm 3] indeed succeeds in finding all large entries of S. Most importantly, its runtime 
is bounded by 0(nrp log 2 p) operations, both w.h.p. and in expectation. 

Theorem 2. Assume S is (r, p, R, 2) -sparse, where R < /i/2. Let I(m ) be the set of 
indices returned by Algorithm [/] with threshold /i and m trees. For a suitably chosen 
constant C that is independent of n, m, r and p, define 


f(p,m ) = Pr [J p (5) C I(m) and Runtime < Cnpr log 2 /?] . 
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Then, for m(p ) = |~641og(2rpL 3 )], where L = log 2 p + 1, 

(%) For all p> 8, f(p , m(jp )) > 2/3. 

/m/ f{p,m(p )) 1. 

(m/ E[iJtmi*me] < C'nrplog 2 p, for some absolute constant C'. 

Remark 3. The sparsity of S is required only to bound the false alarm probability in 
Lemma 0 If S is not necessarily sparse, the algorithm will still locate w.h.p. all of 
its large entries. However, its runtime can increase significantly, up to 0(np 2 log 2 p) 
operations in the worst case when r = p. 

Remark 4. Note that since our tree-based algorithm analyzes each row of S separately, 
in fact S need not be globally (r, p, R, 2) -sparse. Our algorithm is still applicable to a 
matrix S with k non-sparse rows and all other rows (r, p, R, 2) -sparse. On the non- 
sparse rows our algorithm will be quite slow, and would analyze all the leaves of the tree. 
However, on the sparse rows, it would detect the large entries in time 0(npr log 2 p). If 
k = O(logp) the overall run time is, up to logarithmic factors in p, still 0(npr). 

The runtime guarantees of Theorem [4] are probabilistic ones, where the algorithm 
runtime can reach up to 0(np 2 log 2 p) operations, even when S is sparse. For an a-priori 
known upper bound on the sparsity parameter r, one can slightly modify the algorithm 
to stop after 0{nrp log 2 p) operations. This modification may decrease the probability to 
detect all large entries, but still maintain it to be at least 2/3. This is true since the event 
of finding all large entries contains the event of finding all large entries using a bounded 
number of operations, whereas the second event is not affected by the modification. 

4.4. Comparison between the sFFT-based and tree-based algorithms 

Tabic [T] compares the main properties of the sFFT-based and the tree-based algorithms. 
Interestingly, even though the two algorithms are very different, their success relies on 
precisely the same definition of matrix sparsity. Moreover, the difference in the input 
parameters is not significant since knowledge of R yields a lower bound for p, and 
vice versa. Although the tree-based algorithm has a lower runtime complexity and is 
easier to understand and implement, the sFFT-based algorithm is still of interest as it 
illustrates a connection between two different problems which seem unrelated at first 
glimpse; estimating the sparse Fourier transform of a given signal and detecting large 
entries of a sparse covariance matrix. Hence advances in sFFT can translate to faster 
sparse covariance estimation. 
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sFFT-based 

SparseCov Tree 

Sparsity assumptions 

(r, p. R, 2)-sparse with g > 2R + e 

(r, y, R, 2)-sparse with p, > 2R 

Runtime bound 

O (nrp log'* plog((I? + y/rM)p/e) 

O (nrp log 2 p) 

Probability to detect all large entries 

> 2/3 

> 2/3 

Required input parameters 

r,R,e and M = max{Si,;} 

M only 

Dependencies on other algorithms 

Based on the sFFT algorithm 

Standalone 


Table 1: Comparison between the sFFT-based and Tree-based algorithms. 


5. Relation between sample size, sparsity of S and of X 

The theoretical analysis in the previous two sections assumed that the sample covariance 
matrix S is approximately sparse. Typically, however, one may only assume that the 
population matrix £ is sparse, whereas S is computed from a sample of n observations 
zi,..., z n . In general, this may lead to a non-sparse matrix S, even when £ is sparse. 
Nonetheless, we show below that if the underlying r.v. Z is sub-gaussian with a sparse 
covariance matrix £, then given a sufficient number of samples n = 0(p\ogp), w.h.p. 
the corresponding sample covariance matrix S is also approximately sparse, albeit with 
different parameters. 

To this end, recall that a random variable Y is said to be sub-gaussian if 

\\Y\\ 4 , 2 :=supg- 1 / 2 (E|y|'?) 1 /'? <oo 

<7>1 

and similarly, a random variable Y is said to be sub-exponential if 

imk := sup(/-'(ElTl 9 ) 1 / 9 < oo. 
q> i 

See for example [32]. Last, a random vector Z = (Z i,..., Z p ) is said to be sub-gaussian 
if Zi is sub-gaussian for every j G [p]. 

The following theorem provides us sufficient conditions on £, the underlying random 
variable Z and the number of samples n to ensure, w.h.p., the approximate sparsity of 
5 . 

Theorem 3. Assume Z is a sub-gaussian random vector with covariance matrix £ which 
is (r, fi , R, 2 )-sparse where 2 R < /a. Let K = max* | \Zi\\^ 2 and t = min{7^=2—, K 2 }- 

Then, for n > C^r log(54p 2 ), where C is an absolute constant, w.p. > 2/3, S is 
(r,fi — t,^(p — t),2)-sparse. Moreover, with high probability, every large entry of £ 
(w.r.t. fi) is a large entry of S (w.r.t. p — t), and vice versa. 

In addition to the probabilistic guarantees of Theorem [3] for fixed p and n, we can also 
deduce stronger asymptotical results, as n,p-> oo. 
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Theorem 4. Assume Z is a sub-gaussian random vector with covariance matrix E which 
is (r, n, R , 2)-sparse where 2R < p, and let t be as in Theorem 0 Then, as n —>• oo with p 
fixed, or as n,p —>• oo with pl ° gp —>• 0, f/ie probability that S is ( r,p — t , |(/x — i), 2 ) -sparse 
with = Jfi(T) converges to one. 

Note that for large p>l, the parameter t in Theorem[3]is equal to (p—2R) /(2-^/p — r + 1) 
Hence, the required number of samples for S to be approximately sparse is n > Oi\og{p)/t 2 ) 
0{p\ogp). With such a number of samples, we can replace the assumptions on S with 
corresponding assumptions on E and obtain the following result analogous to Theorem 

E 

Corollary 2. Assume E is (r, p, R, 2) -sparse, where R < p/2. Then, as n,p —>• oo with 
pl ° sp —> 0, with probability tending to 1, Algorithm^ finds all large entries of S, which 
correspond to large entries of E, using at most 0{nrp log 2 p) operations. 

In various contemporary statistical applications, the number of samples n is compa¬ 
rable to the dimension p. In such a case, we may still detect the large entries of the 
population covariance matrix in sub-quadratic time. For example, we can divide the p 
variables into K = p 1-Q distinct groups, each of size p a , and separately find the largest 
entries in each of the I\ 2 sub-matrices of the full covariance matrix. In each pair, Corol¬ 
lary 5.1 applies, since if p, n —>• oo with p/n —> const, then p° log p/n —>• 0. The overall 
run time, up to logarithmic factors in p is now higher 0(np 2 ~ a r) but still sub-quadratic. 

Finally, note that throughout this section we assumed that the underlying random 
variable Z is sub-Gaussian. It is an interesting question whether some of the above 
theorems continue to hold under weaker tail conditions. 


6. Simulations 

In this section we illustrate the empirical performance of Algorithm 4, denoted SparseC- 
ovTree, on input drawn from a Gaussian distribution with a sparse covariance matrix. 
Furthermore, we compare it to other algorithms that can be used to locate the large 
entries of S: (1) LSH-/cNN [20]; (2) kd -tree [19] : (3) Dual-Ball-Ball (DBB) |14| : (4) 
Dual-Ball-Cone (DBC) [330; and (5) direct calculation of all entries of S, at a cost of 
0{np 2 ) operations. To the best of our knowledge, no public implementation of the sFFT 
algorithm of [ 2 ] is currently publicly available, thus we did not include it in the compari¬ 
son. For the LSH-£;NN and kd -tree algorithms, we used the mlpack library |33], whereas 
for DBB and DBC algorithms we used the implementation kindly provided to us by [14] . 
All codes are in C++ and were compiled with the 03 optimization flag. 

We generated random population covariance matrices E as follows: first r = [ log3P j 
entries in each row (and their corresponding transposed entries) were selected uniformly 
to be ±1, with all others set to zero. Then, the diagonal entries were set to be ±1 as 
well. Last, to have a valid positive-definite covariance matrix, all diagonal entries were 
increased by the absolute value of the smallest eigenvalue of the resulting symmetric 
matrix plus one. Following this, we chose the threshold to be p = 0.5. 


15 







Figure 2: Average runtime to detect large entries of a sparse covariance matrix as a 
function of the dimension for several algorithms, presented in a logartihmic 
scale in both axes. The input data was drawn from a Gaussian distribution 
with a random population covariance matrix and r = [ lo ° 2P j. In the left panel, 
the number of samples increases with the dimension, n = \j)\ogp\, whereas in 
the right panel the number of samples is fixed, n = 50,000. 


The space complexity of SparseCovTree algorithm, which involves storing m trees, is 
0{npm). This may exceed the available memory for large n,p and m. Thus, instead 
of simultaneously constructing and processing the entire m trees from the coarse level, 
we can construct and process the sub-trees of each node in the different fine levels 
separately, starting from some coarse level h € [log 2 p]. By doing so, the space complexity 
is reduced by a factor of 2 h . These modifications can only increase the probability to 
locate large entries, whereas the theoretical runtime bound of Theorem [2] increases to 
O (nmp(log 2 p — h)2 h \ operations. Particularly, in our experiments we used h = 5, 
reducing the space complexity by a factor of 32. Moreover, the value of m as stated in 
Theorem [2] is required mostly for the theoretical analysis. Since the multiplicative factor 
appearing in Lemma [3] is not sharp, and the probability of failure converges to zero as 
the dimension increases, we can in practice use a smaller number of trees m and still 
detect most large entries of S , in sub-quadratic time. Here we chose a fixed value of 
m = 20, leading to discovery of more than 99% of the large entries for all values of p 
considered. 

As for the LSH-A:NN tuning parameters, we chose the number of projections to be 10, 
hash width 4 and bucket size 3500. For the second hash size, we picked a large prime 
number 424577, whereas the number of tables was configured according to [34] with 
5 = This configuration led to more than 99% discovery rate of all large entries, in 
all tested dimensions. 

Figure [2] shows the average runtime, in logarithmic scale, for various values of p from 
1000 to 10,000. Surprisingly, all alternative solutions, apart from SparseCovTree, yield 
slower runtime performances than the direct calculation. We raise here several possible 
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Figure 3: Average run-time and misdetection rate (in percentage points) as a function 
of e, for fixed p = 2000 and n = 20000. The dashed vertical line is the critical 
value e cr it where for the population matrix R(e cr it) = /i/ 2 . 


explanations for this. First, in all methods, except SparseCovTree and direct calculation, 
to locate negative large entries of S we duplicate the input {x.;} with the opposite sign, 
thus potentially increasing the runtime by a factor of 2. Moreover, it was suggested (see 
m) that space-partitioning-based methods, such as kd- tree, DBB and DBC, may lead 
to slow runtime in high dimensions. For an empirical illustration of this issue, see [35] . 
In our case, the dimension of the search-problem is the number of samples n, which may 
indeed be very high. In contrast to theses algorithms, the LSH method was originally 
designed to cope well in high dimensions. However, to locate most of 0(pr) large entries 
of S, we tuned the LSH-A;NN algorithm to have a small misdetection probability, by 
setting d = Such small value of S led to a large number tables, thus increasing the 
runtime by a significant factor. 

As for SparseCovTree, in low dimensions direct calculation is preferable. However, for 
larger values of p, SparseCovTree is clearly faster. Moreover, the slope on a log-log scale 
of the direct approach is roughly two, whereas our method has a slope closer to 1 . 

In the above simulation, the underlying population matrix £ was exactly sparse. Next, 
we empirically study the ability of the SparseCovTree algorithm to detect the large 
entries when the underlying population covariance matrix is not perfectly sparse, but 
only approximately so. Specifically, we generated a population covariance matrix similar 
to the procedure described above, only that the previously zero entries, are now ±e. 
Figure [3] presents both the average run-time as well as the misdetection rate of our 
algorithm with m = 10, 25, 50 trees, as a function of e, for a covariance matrix of size 
p = 2000 and n = 20000 samples. Each row of £ has an average of r = 8 large entries, all 
of size p = 1. Hence, R = R(e) ~ (p — r)e 2 . The critical value e cr u at which R = p/2 is 
thus e cr n = 1/y/A{p — r ). As seen in the left panel, the run-time scales roughly linearly 
with the number of trees, is nearly constant for e < e cr it and slowly increases for larger 
values of e. The right panel shows that in accordance to our theory, the misdetection 
rate is also nearly constant as long as e < e cr n. 

Finally, we demonstrate an application of the SparseCovTree algorithm to detect near 
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Figure 4: Average runtime per single query to detect near-duplicates in a large artificial 
data set with p elements of dimension n = 40, 000, presented in a logarithmic 
scale in both axes. For SparseCovTree, we present the average runtime both 
for a near-duplicate and for a general input query. For the other two methods, 
this partition did not affect the runtime significantly. 


duplicates in a large artificial data set: Given a reference data set with p elements, 
{xj}f ) =1 , with each x,- e R”, the goal is to quickly decide, for any query vector y e R n , 
whether there exists x ? ; s.t. S7m(xj, y) is larger than a given threshold p, where Sim 
is the Cosine Similarity. In contrast to the previous simulation, here the key quantity 
of interest is the average query time, and the computational effort of the pre-processing 
stage is typically ignored. 

Specifically, we considered the following illustrative example: We first generated a 
reference set {x ? ;}^ =1 , whose p elements were all independently and uniformly distributed 
on the unit sphere S n ~ 1 . Next, we evaluated the average run-time per query in two 
scenarios: a) a general query y, uniformly distributed on S n , and hence with high 
probability, not a near-duplicate; b) a near duplicate query u, generated as follows: 


u = 


x/1 — exj + y/e 



( zTx j) x j 

(z T Xj) x i 


where z is uniformly distributed on S n ~ l and the duplicate index j is chosen uniformly 
from the set of elements \p\. 

In our simulations, the parameters e = 0.25, n = 40,000 and m = 20 trees were kept 
fixed, and we varied the size p of the reference set. For a fair comparison to LSH, we 
decreased its number of projections to 7, and set its misdetection probability per single 
projection to be 5 = 0.3. Empirically, both our method and LSH correctly detected a 
near-duplicate with probability rate larger than 99%. 

Figured] shows the average query runtime for SparseCovTree, LSH-fcNN and the direct 
approach. In both LSH-fcNN and direct approach, the average query runtime did not 
depend on whether the query was a near-duplicate or not, and hence we show only one of 
them. For SparseCovTree, in contrast, the average runtime for a general query is lower 
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than for a near-duplicate query. The reason is that for a general query our algorithm 
detects that this is not a near-duplicate by processing only the nodes at the first few 
top levels of the tree. For this problem, the LSH method is clearly preferable over direct 
calculation for all tested values of p. When the number of elements p becomes sufficiently 
large, our method becomes faster. Most importantly, the runtime curves of SparseC- 
ovTree, both for the true and false near-duplicates, seem almost constant compared to 
the other runtime curves, which are approximately linear in p. This is consistent with 
the theoretical analysis, as SparseCovTree query time is 0(n log 2 p) operations, which is 
sub-linear, whereas the direct calculation query time is 0(np ) operations, which is linear 
thus significantly larger, and LSH-fcNN query time is 0(np p poly log p) operations, for 
some constant p > 0, which is also sub-linear but not logarithmic as the SparseCovTree 
query time. 

7. Summary 

In this paper we considered the computational aspects of detecting the large entries of 
a sparse covariance matrix from given samples. We derived and theoretically analyzed 
two different algorithms for this task, under sparsity assumptions either on S itself, or 
on X and additional requirements on the underlying random variable Z and number of 
samples n. We next demonstrated the time-efficiency of the algorithms, theoretically for 
both of them and empirically only for the tree-based one. 

Our work raises several interesting questions for future research. If an oracle gave us 
the precise locations of all large entries, computing them directly would require Pl(npr) 
operations, a quantity lower by a poly log p factor from our results. This raises the 
following fundamental question: is it possible to reduce the poly-log factor, or is there 
a theoretical lower-bound on the required number of operations? In addition, in this 
article we only considered a L 2 -norm in the definition of matrix sparsity. It may be of 
interest to consider different norms, e.g. L\, which could lead to different algorithmic 
results. Finally, both algorithms require partial knowledge on the sparsity parameters. 
It is thus of interest to efficiently estimate these parameters from the data, in case they 
are unknown. 


A. Appendix - Proofs 


A.l. Proof of Lemma [2] 

Proof. For every independent run i € [m], by definition (yi)j = 0 for all j ^ Jj. Hence, 

II S k - y*|| 2 = Yl(S kj - (yi)j) 2 + £( s kj ) 2 > ^(%) 2 = II S k - S J k > || 2 (10) 

teJi j^Ji j^Ji 


where (S'l’)j = S k j for j € J; and otherwise {S^)j = 0 . Moreover, for every i e [m], 


\\S k \\ 2 -\\S k -S*\\ 2 = '£(S kj ) 2 < XI (^ i ) 2 = ||^|| 2 -||5 A; -5 fc || 2 . 


jeJi 


JGUi Ji 
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Hence 


( 11 ) 


||5 fc -5 fc || < || Sk-Sf 
By combining Eqs. (flOl) and (fill) , we obtain 

||Sfc - S k \\ < min \\S k - y*||. ( 12 ) 

ie[m] 

Since y* is an output of the sFFT algorithm, according to Eq. 0 it satisfies the following 
inequality, w.p. >2/3 

ll-S'fc — y*|| < (1 + «) min \\S k - y|| + S\\S k \\. (13) 

l|y||o<r 

Thus, by a union bound argument, w.p. > 1 — (l) m = 1 — there exists at least one 
index i for which Eq. (1131) holds. Combining this with Eq. (1121) concludes the proof. 

□ 


A.2. Proof of Theorem [Tj 

Proof. First we show that a sufficient condition for Algorithm |T| to detect all large entries 
of S is that Eq. 0 holds for all rows. Then, from Corollary[TJ the probabilistic guarantee 
will follow. Using Eq. (0 with 6 = R+ ^ M and a = 1 yields 

\\S k - Sfcll < 2 min ||5 fc -y|| + 6 \\S k \\. (14) 

||y||o<r R + y/rM 

Since S is (r, /i, R, 2 )-sparse, then \\Sk\\ < R + yfrM , which implies that 

\\S k - S k \\ < 2 min ||5 fc -y|| + e. 
l|y||o<r 


Let T fl (Sk) be the k-th row of S thresholded at level fi, 


(T^(5 fc ))j = S kj l[\S kj \ > n]. 


Each row of S has at most r large entries, thus ||T/(S/,)||o < r and from Eq. (1141) we 
obtain 


\\Sk — 5/11 < 2||Sfc 


T»(S k ) ||+6 = 2 


E ( s *j ) 2 


1/2 

+ e < 2R + e. 


Now, assume to the contrary that there exists a large entry S k j, for j € <7^(5/.), which 
the algorithm failed to detect. Then 


H-S’fc - Sfclh > \S k j\ > n 


meaning /i < 2 R + e, which contradicts the assumption. 
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As for the run-time of the algorithm, the first two steps, calculating the matrix W 
and the bound M, can be performed using only 0{np\ogp ) operations. Then, for every 
row k, the algorithm makes O(logp) queries of the sFFT algorithm, where each query 
requires 0(nr logplog((i? + y/rM)p/e )) operations. As for the last step, evaluating the 
output S requires 0(n\I\) operations. Since the size of I cannot exceed the number of 
sFFT queries multiplied by the number of operations for a single query, we conclude 
that the total number of operations is at most 0(nrp log 2 p\og{(R + y/rM)p/e)). 

□ 

A.3. Proof of Lemma [3] 

Proof. First, let us study the quantity cr 2 and the distribution of <r 2 , under both the null 
and under the alternative. 

If Ho holds, since S is (r, p, R, 2)-sparse, 


a 


.2 


Y {*k,xj) 2 < Y ( x fc,xj) 2 < r 2 < m 2 /4. 


j£l(h,i ) 


&Ms h ) 


In contrast, if Hi holds, then there is at least one entry \Skj\ > p, leading to 


cr 2 = Y ( x fc; x i) 2 > Y (xfc.Xj) 2 > p 2 ■ 


j£l(h,i) 


j£JHS k )nI(h,i) 


Therefore, the two hypotheses in Eq. ([8]) may also be written as 


Ho : cr 2 < R 2 vs. H\ : a 2 > p 2 . 
Since for every l e [m], yi ~ N(0,a 2 ), 



In m an d 0Z1 respectively, the following two concentration inequalities for a y 2 random 
variable were derived, for all m > 2 (see also m) 



Pr — >l + e < -exp (—~(e — log(l T e))! Ve>0 (15) 



VO < e < 1. 


(16) 


We use these inequalities to bound the probabilities of false alarm and misdetection: 


Pr[false alarm] = Pr <r 2 > Ho = Pr ^221 > a 2 < R 2 < Pr 

1 “4 m ~ Ao 2 ~ ~ m ~ 4 R 2 


Since 4 R 2 < p 2 , 

r Y 2 

Pr [false alarm] < Pr > 3 


m. 
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Using (fl5l) with e = 2, we obtain for m > 31og(i) 


m, 


Pr [false alarm] < exp ( — — (2 — log 3)) < 5. 


Similarly for the probability of misdetection 

Pr [misdetection] = Pr 

Using (1161) with e = 1/4, we obtain for m > 641og(i) 

Tx-2 -a 

Pr [misdetection] < Pr 


„ 2 3/r 2 

r<T- 

Hi 

= Pr 

'Xm c ?J l j2 
m 4cr 2 

2 \ 2 
a >ii 

< Pr 

r y 2 3 i 

Am ^ * 

m■ 4 


y 2 3 
m 4 


m 


< exp ( - < S- 


□ 


A. 4. Proof of Theorem [2] 

Proof. We first introduce the following notation. For each k E [p] and h E [L ], divide 
the nodes into two disjoint sets 

A h k = {( h,i ) : t E [2% I(h,i ) n J^Sk) + 0} 

B% = {(h,i):i€ [2% I(h,i) n J^Sk) = 0} 

We say that the algorithm visited or entered a node T(h,i ) while processing the trees 
with Xfc if during its execution the call FindCx^, h, i, {T/}, p) occurred. Notice that as 
the algorithm processes the trees with input in order to detect all large entries in 
the k -th row, while at level h it must visit all nodes in A h k . In contrast, to have a 
fast runtime it should avoid processing nodes in B k . Moreover, recall that the trees- 
construction stage deterministically requires 0{np logp) operations. Thus, to prove the 
run-time bound, it suffices to show that after the preprocessing stage the algorithm uses 
at most O(nrplog 2 p) operations, w.h.p. for (i) and (n), and in expectation for (in). 

Proof of (i) and (ii). Let Vjf be the event that the algorithm visited all nodes in A k 
but no nodes of B k , while processing the h -th level of the trees with Since S is 
sparse, \A^\ < r, and therefore 


U 

fcS [p], h£ [L\ 


< rpL. 


Moreover, since the algorithm detects all large entries if and only if it visits all nodes of 
all sets A k , it suffices to bound the probability of V k simultaneously occurring for all 
k E \p\ and h E [L]. 

it has at most 2r new nodes to check in the h + 1 level, unless h = L. According to 
lemma [3] with 6 = 2r piP > probability of the algorithm to enter a node in B\ or to 
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skip a node in A k . is at most 2r pL 5 • Denote by V k the complementary event of V k . This 
implies, 

1 1 


Pr 


Vu 


< 2 r ■ 


2 rpL 3 pL 3 


(17) 


Since 


Pr 


v k 

= Pr 

Vk 

v£,...,vt\ 

Pr 

i 



+ Pr 

Vk 

3i€[h-i\:T k 

Pr 

€ [h - 1] : T k 


then 


Pr 


V: 


v k \...,v. 


h-1 


Clearly PrfV^ 1 ,..., V k 1 ] < 1, and thus 
Pr 

We use Eq. (HZD to obtain 


v£ 

Vk 1 ,- 

k 

i 

> Pr 

v£ 

- Pr 

LU ' 

<S>. 

m 

l 

\ —* 

^1. 
i_i 


Pr 


V: 


Vl-..,v£- 1 ] > 1 - -(/*-!) > 1 - 


pL 3 


pL 3 pL 2 


Next, we derive, for a fixed k, 


Pr 


Vh : V£\ = Pr [V k °] • [] Pr \v£ V k °,..., V k 


h =1 


rh— 1 


> 1 - 


pL 2 


where clearly Pr[V7°] = 1. Therefore, 


Pr [V£] - Pr [V£\ 3ie[h- 1] : V k \ Pr 

3i € [h - 1] : V k 

Pr 

v 1 v 11 - 1 

v k T ■ ■ > v k 



Pr[Vfc, h : V£] = 1 - Pr[3fc, h : V k \ > 1 - ^ Pr 


k =1 


3h : V h k 


>1~P 1- 1- 


pL 2 


(18) 


To bound this expression, consider the function g(x) = (1 — x) L . Since </'(£) > 0 for 
all £ G [0,1], it readily follows that for all x € [0,1], g(x) > (1 — xL). Hence, for x = 
we obtain 


1 - 


pL 2 


- 1 pL' 


(19) 


Combining Eq. (fl8l) with Eq. (fl9l) yields 

Pr[Vfc, h : V k ] >l—p(l—{l — 


pL 


= 1--— i 

log 2 P 
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which proves (ii), and similarly, for p > 8, 


Pr [\/k,h : Vjf] >2/3 


which proves (i). 

Proof of (Hi). Let X 1 ) % be the indicator of the event that the algorithm entered the 
node while processing the tree with x*.. Recall that 


Pr 


Ki = 1 


< Pr 


<r 2 (/(M)) > 


For nodes in this probability is clearly at most one, whereas for nodes in £>/ it 
is at most <5, as m > 641og(^). By its definition, the algorithm makes at most cnm 
operations, for some absolute constant c, in each node it considers. Thus, 


E [Runtime] = E [Number of visited nodes] • cmn 
p L 2 h 

= E£5> 

k=1h=1i—1 
V L 

< (i^i + \ B k \ 6 ) • cmn 

k=1 h= 1 

< p (rL + 2 p8) ■ cmn. 

For m > 64 log p, 5 = |, and we conclude 

E [Runtime] < C'nrplog 2 p 

for some absolute constant C'. 


n, = i 


cmn 


□ 


A.5. Proof of Theorem [3] 

To prove Theorem [3] we first introduce the following auxiliary lemma. It provides ex¬ 
ponential tail bounds on the deviations of a quadratic form from its mean value, when 
the underlying random vector has non-zero mean. This lemma thus generalizes previous 
concentration bounds for quadratic forms [39] [40], which all assumed the vector has 
mean zero. Hence, it may be of independent interest. 

Lemma 4. Let Y = (Y±,... ,YnY be a sub-gaussian random vector where all Yi’s are 
independent and let K = max* ||Li||^ 2 . Let A £ R nxn be a real symmetric matrix. Then, 
for every t > 0, 


Pr [| Y t AY 


E^AY}] > t] < 6exp 


—cmin 



l)K*\\A\\yK*\\A\\J\ 


where c is a positive absolute constant. 
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Proof. First, observe that due to the independent of the T^’s, 

Y t AY - E [Y t AY] = Y a, - Eif] + Y ofV,}) - E^E Yj\. 

i i^j 

Since 

(Yi - E Y i )(Y j - EYj) = YiYj - YjEYj, - YiEYj + EYtEYj 
and A is symmetric, we obtain 

2 - El^Elj] = dij(Yi - E Yi){Yj - EYj) + 2 a^El} - E^EY-] 

iy*Aj 

= Y Oijpi - E ^.)( y i - E 1 i) + Y bi\Yi - EYi\ 
iAj * 


where 6 j = 2 - a ij^yj- Therefore, 


\Y f AY -E[Y l AY]\ < 

5 >^[lf-Elf] 

+ 

Y<Hj(Y i -'EYf)(Y j -EYj) 

+ 

YWi-EYi] 


i 


iAj 


i 


and the problem reduces to estimating the following probabilities 


p := Pr [| Y t AY - E^AYW > t] < Pr 


^a^lf-Elf]| >t/3 


+ 


( 20 ) 


Pr 


Pr 


Y M y * - EY i)( Y J - Ey i)l > */ 3 


iAi 


+ 


Y h i[Yi-^Yi]\ > i/3 


=: Pi+ P-2 + P3- 


As for the hrst term, notice that if — Elf’s are independent centered sub-exponentials, 
with 

WYf-EYfW^KmW^KAK 2 . 

Thus we can use Proposition 5.16 of [32] to obtain 

f 2 


pi < 2 exp 


-ci mm 


t- 


t 


KA Ei 4 ’ R2 


max a,-,- 


for some absolute constant ci. Next, to bound the second sum, we follow the arguments 
of [39] in the proof of the Hanson-Wright inequality to obtain 


P2 < 2 exp 

for some absolute constant C 2 . 


t z 


-C 2 nun 


t 


\K*\\A\\y K*\\A\\ 
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If E[y] = 0 or A is a diagonal matrix, then all coefficients b 7 = 0. The third term 
in Eq. (1201) then vanished, and we recover a results similar to the original Hanson- 
Wright inequality. Assume then that E Y Y 0 and A is not diagonal, which leads to 
b / 0. Since the random variables Y % — EYi’s are centered independent sub-gaussians 
with ||Yi — ¥Yi\\^ 2 < ||Ej ||^ 2 + |EY,;| < 2K, we then use Proposition 5.10 of [32] to obtain 


P3 < 2 exp 


t 2 1 

C3 A' 2 ||b||2_ 


for some absolute constant C 3 . Since ||b III < (n- 1)K 2 \\A\\ 2 F , < \\A\\ 2 f and 

max | a a | < ||A|| the lemma follows. 

□ 


Proof of Theorem 0 Let A be the n x n matrix 


A = 


1 


n — 1 


I n -ee* 

n 


where e = (1,..., 1) € M n . Since 


1 


Sij = zjAzj = - [zjAzi + ZjAzj - (zj - z j ) t A(z i - Zj 


and Yiij = EfS'y], then 

Pr [\Sij - Y,ij\ > t ] < Pr 
Pr 

Pr 


2 1 

\z\Azi - E[z-Azj]| > — 
9 f 

I z j Azj — E [zfjAzj}\ > — 


+ 

+ 


2 1 


|(zj - ZjfAizi - Zj) - E[(zj - ZjyA{z,i - zj)]\ > — 


Now, since zand z, — z j are sub-gaussian vectors with independent entries, we use 
Lemma 0] to obtain 

+2 


Pr [|5jj — Sjj| > t\ < 18 exp 


—cnnn 




t 


(n-l)iL 4 ||A||^iL 2 ||A|| 


for some absolute constant c > 0. Clearly both ||A||i? and ||A|| are bounded by 
and therefore we conclude that for t < K 2 


Pr [| Sij - Sy| > t] < 18exp ^- Ct ^ ^ 


( 21 ) 


for some absolute constant (7. For n > ^-p— log(54p 2 ), this probability is smaller than 
Then, applying a simple union-bound argument yields 


Pr [Vi, j : |Sij - < t] > -. 


( 22 ) 
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Next, assuming the event in Eq. (1221) holds, we show that for sufficiently small t, 
sparsity of E implies sparsity of S. In particular, for t < min{^^=^-j-, we show 

that S is (r, p — t, \{p — t), 2)-sparse, and moreover, that J M (E) = 

Indeed, if > /i then according to Eq. (1221) . 


I ^kj — E k j E kj S kj \ P (i t 

meaning J M (E*,) C J^_ t (S k ). Similarly, if \S k j\ > p — t then 

|S fcj -| > \S kj \ - |E fci - S kj | >p-2t>p- 

Since every element of E (in absolute value) is either larger than p or smaller than R, 
the condition \S k j\ > p — t thus implies that |Ejy| > P and hence ,7 /t (Efc) D J^ t (S k ). 
Combining these two results above yields that w.h.p. J M (E^) = and entries of 

S are large if and only if the corresponding entries of E are large. 

As for the gap, 

E (^) 2 < E (lEfcji+1) 2 = E (i s ^i+o 2 

E E (^fcf) 2 + E + 2 E |Efcj|7 

jZMXk) j<ZMx k ) 

< R 2 + (p — r)t 2 + 2 ty/p — rR = (R + y/p — rt) 2 . 

For t < 7T7==tt : we obtain 
— 2y/p— r+1 7 

E (■V) 2 <(I(/‘-‘)) 2 

— tints') 


which implies that 5 is (r, p — t,^(p — t), 2)-sparse. 


□ 


A.6. Proof of Theorem [4] 

Proof. For any n and p, let /3(n,p ) be s.t. n = log (/3(n,p)p 2 ) + 1. Notice that for n 
and p large enough, j3(n,p) > 0. By applying a union-bound argument on Eq. (12T1) we 
obtain, instead of Eq. (|25|). 

Pr [Vi, j : | Sij - E^ | < t\ > 1 - 1 . 

Clearly as n -> oo with pl ° sp —>• 0, /3(n,p) —>• oo, which implies that this probability 
converges to one. 

□ 
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